Goto

Collaborating Authors

 scikit-learn pipeline


Data Science Quick Tip #003: Using Scikit-Learn Pipelines!

#artificialintelligence

We're back this week with another data science quick tip, and this one is sort of a two parter. In this first part, we'll be covering how to use Scikit-Learn pipelines with Scikit-Learn's barebones transformers, and in the next part, I'll teach you how to use your own custom data transformers within this same pipeline framework. Before getting into things, let me share my GitHub for this post in case you want to follow along more closely. I've also included the data we'll be working with as well. Check it all out at this link.


Yet Another Library for Deep Learning You Should Know About

#artificialintelligence

It has many algorithms, supports sparse datasets, is fast and has many utility functions, like cross-validation, grid search, etc. When it comes to advanced modeling, scikit-learn many times falls shorts. If you need Boosting, Neural Networks or t-SNE, it's better to avoid scikit-learn. While MLPClassifier and MLPRegressor have a rich set of arguments, there's no option to customize layers of a Neural Network (beyond setting the number of hidden units for each layer) and there's no GPU support. While there are already superior libraries available like PyTorch or Tensorflow, scikit-neuralnetwork may be a good choice for those coming from a scikit-learn ecosystem.


Deep Learning with scikit-learn

#artificialintelligence

It has a good set of algorithms, supports sparse datasets, it is fast and has many utility functions, like cross-validation, grid search, etc. When it comes to advanced modeling, scikit-learn many times falls shorts. If you need Boosting, Neural Networks or t-SNE, it is better to avoid scikit-learn. There is MLPClassifier for classification and MLPRegressor for regression. While both have a rich set of arguments, there isn't an option to customize layers of a Neural Network (beyond setting the number of hidden units for each layer).


Supercharging Hyperparameter Tuning with Dask

#artificialintelligence

Hyperparameter tuning is a crucial, and often painful, part of building machine learning models. Squeezing out each bit of performance from your model may mean the difference of millions of dollars in ad revenue, or life-and-death for patients in healthcare models. Even if your model takes one minute to train, you can end up waiting hours for a grid search to complete (think a 10x10 grid, cross-validation, etc.). Each time you wait for a search to finish breaks an iteration cycle and increases the time it takes to produce value with your model. In this post, we will see show how you can improve the speed of your hyperparameter search by over 100x by replacing a few lines of your scikit-learn pipeline with Dask code on Saturn Cloud.


Azure.Source - Volume 68

#artificialintelligence

Scale out read-heavy workloads on Azure Database for PostgreSQL with read replicas, which enable continuous, asynchronous replication of data from one Azure Database for PostgreSQL master server to up to five Azure Database for PostgreSQL read replica servers in the same region. Replica servers are read-only except for writes replicated from data changes on the master. Stopping replication to a replica server causes it to become a standalone server that accepts reads and writes. Replicas are new servers that can be managed in similar ways as normal standalone Azure Database for PostgreSQL servers. For each read replica, you are billed for the provisioned compute in vCores and provisioned storage in GB/month.


Microsoft joins the SciKit-learn Consortium

#artificialintelligence

As part of our ongoing commitment to open and interoperable artificial intelligence, Microsoft has joined the SciKit-learn consortium as a platinum member and released tools to enable increased usage of SciKit-learn pipelines. Initially launched in 2007 by members of the Python scientific community, SciKit-learn has attracted a large community of active developers who have turned it into a first class, open source library used by many companies and individuals around the world for scenarios ranging from fraud detection to process optimization. Following SciKit-learn's remarkable success, the SciKit-learn consortium was launched in September 2018 by Inria, the French national institute for research in computer science, to foster growth and sustainability of the library, employing central contributors to maintain high standards and develop new features. We are extremely supportive of what the SciKit-learn community has accomplished so far and want to see it continue to thrive and expand. By joining the newly formed SciKit-learn consortium, we will support central contributors to ensure that SciKit-learn remains a high-quality project while also tackling new features in conjunction with the fabulous community of users and developers.


Managing Machine Learning Workflows with Scikit-learn Pipelines Part 1: A Gentle Introduction

#artificialintelligence

Are you familiar with Scikit-learn Pipelines? They are an extremely simple yet very useful tool for managing machine learning workflows. A typical machine learning task generally involves data preparation to varying degrees. We won't get into the wide array of activities which make up data preparation here, but there are many. Such tasks are known for taking up a large proportion of time spent on any given machine learning task.


The Beginner's Guide to Text Vectorization MonkeyLearn Blog

@machinelearnbot

Since the beginning of the brief history of Natural Language Processing (NLP), there has been the need to transform text into something a machine can understand. That is, transforming text into a meaningful vector (or array) of numbers. The de-facto standard way of doing this in the pre-deep learning era was to use a bag of words approach. The idea behind this method is very simple, though very powerful. First, we define a fixed length vector where each entry corresponds to a word in our pre-defined dictionary of words.